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Abstract 



This paper addresses the problem of approximate MAP-MRF inference in general 
graphical models. Following f36l, we consider a family of linear programming 
relaxations of the problem where each relaxation is specified by a set of nested 
pairs of factors for which the marginalization constraint needs to be enforced. We 
develop a generalization of the TRW-S algorithm |9| for this problem, where we 
use a decomposition into junction chains, monotonic w.r.t. some ordering on the 
nodes. This generalizes the monotonic chains in |9| in a natural way. We also 
show how to deal with nested factors in an efficient way. Experiments show an 
improvement over min-sum diffusion, MPLP and subgradient ascent algorithms 
on a number of computer vision and natural language processing problems. 

1 Introduction 

This paper is devoted to the problem of minimizing a function of discrete variables represented as 
a sum of factors, where a factor is a term depending on a certain subset of variables. The problem 
is also known as MAP-MRF inference in a graphical model. Due to the generality of the definition, 
it has applications in many areas. Probably, the most well-studied case is when each factor depends 
on at most two variables (pairwise MRFs). Many inference algorithms have been proposed. One 
prominent approach is to try to solve a natural linear programming (LP) relaxation of the problem, 
sometimes called Schlesinger LP ||35]| . A lot of research went into developing efficient solvers for 
this special LP; some example are llMl l9l IBl [Tl l5l [8l l20l l27l 11 [30l fBl [TTl l22i . 

A similar LP can also be formulated for higher-order MRFs. In fact, this can be done in many ways. 
We follow the formalism of 1 36 1 who describes a family of LP relaxations specified by a set of pairs 
of nested factors for which the marginalization constraint needs to be enforced. This approach can 
also be used for pairwise MRFs: we can obtain a hierarchy of progressively tighter relaxations by 
(i) grouping some pairwise factors into larger factors (or introducing higher-order factors with zero 
cost functions), and (ii) formulating an LP for the resulting higher-order MRF. This hierarchy covers 
the Sherali-Adams hierarchy but gives a finer control over the relaxation (see [26]). 

Contributions We present a new algorithm for solving the relaxation discussed above. It builds 
on the sequential tree-reweighted message passing (TRW-S) algorithm of ||9l (which in turn builds 
on ll34l ). TRW-S showed a good performance for pairwise MRFs 1291 l30l [22] . so generalizing 
it to higher-order MRFs is a natural direction. While developing such a generalization, we had 
to overcome some technical difficulties such as finding the right definition for monotonic junction 
chains and deciding how to deal with nested factors. 

Related work A general framework for obtaining convergent algorithms called tree-consistency 
bound optimization (TBCO) was proposed in 1 16|. It covers many existing techniques (such as MSD 
and MPLP), as well as ours. However, the authors of did not propose any specific choices for 
the case of higher-order factors, restricting their experiments to 4-connected grids. The efficiency 
of computing min-marginals was also not considered. In contrast, the focus of our paper is on 
investigating which choices lead to more efficient techniques in practice. Note, monotonicity for the 
higher-order case was not mentioned in lT6l . 
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Another related technique is the min-sum diffusion (MSD) algorithm ll36l . It can be shown they 
have similar theoretical properties; they both monotonically increase a lower bound on the function, 
and are characterized by similar stopping criteria. Note, they are not guaranteed to solve the LP 
exactly - they may get stuck in a suboptimal point ||9l!35). Other techniques with such properties 
(formulated for restricted cases) include MPLP |25, 27 1 and the method in |[T4l : they address the 
problem of tightening Schlesinger LP for pairwise MRFs. [6J considered the case of factor graphs, 
or relaxations with singleton separators. 

A lot of research went into developing algorithms that are guaranteed to converge to an optimal 
solution of the LP. Examples include subgradient ascent techniques (fTO',fTTl), proximal projections 
(120]), Nesterov schemes (|8, 21 1), an augmented Lagrangian method |l3] [iTl . and the technique 
in ll22il described as the "smoothed version of TRW-S". According to ll22l . the latter outperforms 
many other techniques on the stereo problem. 

Our results in section[5]indicate that TRW-S generally outperforms other popular techniques that we 
tested, namely MSD, MPLP and a subgradient ascent. 



2 Background and notation 

We will closely follow the notation of ||36ll . Let V be the set of nodes. For each node v G V let Xy 
be the finite set of possible labels for v, and X = ^^^yXy be the set of labelings of V . Our goal 
will be to minimize the function 

f{x\e)^Y.^A{xA), X^X (1) 

where C 2^ is a set of non-empty subsets of V (also called /actors), xa is the restriction of x to 
AQV, and is a vector with components {Oa{xa) | A G J-", xa G ®veAXv)- 

Let J be a fixed set of pairs of the form (A, B) where A,B ^ F and B C A. Note that {T, J) is a 
directed acyclic graph. We will be interested in solving the following relaxation of the problem: 

""^ ^^^i^A)f^A{xA) (2) 

where C{J) is the J-based local polytope of (V, T): 

J2ij-a{xa)^1 VAeJ", cca 

£{J)={tj,> 



J2 ij.a(.xa) ^ ij.b{xb) ) (3) 



y{A,B) e J,xb 

Here and below we use the following implicit restriction convention: for B C A, whenever symbols 
Xa and x g appear in a single expression they do not denote independent joint states but Xb denotes 
the restriction of xa to nodes in B. 

As an example, one could define J = {{A, {v}) \ AeT,v ^ A}; graph (J^, J) is then known as a 
factor graph. It can be shown that the resulting relaxation is tight if each term Oa is a submodular 
function |36|, but for non-submodular functions we may need to add extra edges to J to tighten the 
relaxation. Note, in general conditions A, B E B C A don't imply that {A, B) E J. Requiring 
the latter would be unreasonable; if, for example, |^|, \B\ ^ 1 then adding edge {A, B) to ,/ would 
lead to a relaxation which is computationally infeasible to solve. 

Proposition 2.1. The following two operations do not affect the set C{J), and thus relaxation 

• pick edges (A, B), {B, C) e J, add {A, C) to J. (4a) 

• pick edges (A, B), {A, C) E J with B D C, add {B, C) to J. (4b) 

A proof is given in Appendix A. We denote J the closure of J with respect to these operations; in 
other words, J is obtained from J by applying operations (|4]) while possible. We have£( J) =£( J). 

We mention that taking the closure will not cost us anything: each pass of our final Algorithm[3]will 
use at most one message operation per factor in F. Using J will be quite important; for example, it 
will allow us to extend an ordering on nodes to an ordering on factors in a consistent way. 

Reparameterization and dual problem For each {A,B) E J let m^s = {mAB{xB)) be a 

message from A to B. Each message vector m = (toab) defines a new vector 9 = 9[m] according 



2 



to 

dsixB) ^ Ob{xb) + ^ mAB{xB) - ^ mBc{xc) (5) 

A\{A,B}eJ C\{B,C}eJ 

It is easy to check that 9 and 6 define the same objective function,_i.e. f{x \ 9) = f{x \ 9) for all 
labelings x ^ X. Thus, 6* is a reparamete rization of 9 1.34,1 . If ^ = 9[m] for some vector m then we 
will write this as 6* = 0. 

Using the notion of reparameterization, we can write the dual of ([2]) as follows ll36l : 

max y mm9A{xA) (6) 

Convex combination of subproblems Let The a set of subproblem indexes and p : T {0,1] be 

a probability distribution on T with J^t — ^- Each subproblem T G T is characterized by the 
set of factors ^ T. For factor A G T let Ta — {T E T \ A G Tt} be the set of subproblems 
containing A. For each T G T we will have vector 9'^ of the same dimension as 9. The collection 
of vectors 9^^ will be denoted as 9 ~ {9^ | T e T). Let fi be the following constraint set for 9: 



9'^{xa)^0 \IT,A(^T Tt,xa 



(7) 

The first condition says that 6*^ must respect the structure of subproblem T, while the second con- 
dition means that is a p- reparameterization of 9 |[34l . 

For a vector 9 = {9"^ \ T eT) let us define 

$(0)-^p^mm/(a;|0^) (8) 

T 

Clearly, if 9 E ft then $(0) is a lower bound on the minimum of function f{x \ 9). Our goal will be 
to compute vector 9 E ^1 that maximizes this bound, i.e. solve the problem 

max$(0) (9) 
9 en 

Decomposition into junction trees For a factor A e J" we denote Ta ~{BeT\{A,B)eJ}U{A}. 
We say that factor A E T is outer if it has no incoming edges in [F, J) (or equivalently in {F, J)). 
The set of outer factors will be denoted as O C J^. Non-outer factors will be called separators, and 
their set will be denoted as S = J- — O. Finally, for subproblem TeT we denote Ot = D Ft- 

In this paper we will be interested in decompositions satisfying the following properties: 

L There holds Ft ~ V^AeOr subproblem T is completely specified by its set of outer 

factors Ot- 

2. There exists a junction tree {Ot,£t), o. tree-structured graph {Ot,£t) with the running 
intersection property f3^: for any A, B € Ot all factors C S Ot on the unique path connecting 
A and B satisfy Ar]B QC. 

3. For each {A, B) £ Et there holds AC^B e Fa and A n B G Fb- 

In general, conditions A, B E F, B C A don't imply B e Fa- However, the following holds (see 
Appendix B): 

Proposition 2.2. If A,B e Ft and BCA then B e Fa- 
We will restrict slightly allowed sets J by assuming 

4. IfveAeF then {v} £ Fa- 

and also allow only one tree per outer factor: 

5. There holds \Ta\ — Ifor each A £ O. 

The last condition is not really an inherent limitatiorQ but it will help to simpUfy the presentation 
of the algorithm. Furthermore, in practice there is no clear reason to cover outer factors more than 

once. 

' If we have a decomposition in which factor A £ O belongs to several trees, then we can do the following 
transformation: add to V new "dummy" nodes vt for each T G 7a, add to F new outer factors A U {vt} with 
zero cost functions, add to J edges {A U {vt}, A), and finally assign A U {vt} to tree T. 
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3 TRW-S algorithm 

We will start with a general version of the algorithm for an arbitrary decomposition into junction 
trees, and then present a more specialized version for monotonic chains. 

We will need the following notation. For tree T and factor A e Ft we denote 

We say that gives correct min-marginals for T if 

yl(xA)= f{x\el) ^xa (11) 

3.1 General version of TRW-S 

The algorithm will rely on two operations: 

1. Average factor B £ S: 

• compute i^B ^ {Y^TeTB P^^b) I CEteTb P^) (^^^^ 

• update parameters 9^ for T £Tb so that we get = vb for all T e Ts (12b) 

2. Send message B in T where A^B^Ft-, {A,B)eJ: 

£ S'^{xb) = mina;^_B ^a{xa) - vj^ixB) Va; 
'\xpA2X&9'^{xa):^9'^{xa)-S'^{xb) Vcc^ and 6%{xb):=9'1{xb) + S'^ {xb) ^xb (13b) 

Note that after update ( [T3| l message A ^ B becomes valid in T, i.e. there holds 
minaiA-B '^a{^a) = i^Bi^B) for all xb- This is equivalent to 

min V 9^{xc) = ^xb (14) 

Xa-b — 

ceJ^A-J^B 

The TRW-S algorithm simply performs min-marginal averaging operations for factors B E S: 



• compute 5^{xb) = mina;^_B ^^i^A) - ^^K^^s) Va;_B (13a) 

• 1 



Algorithm 1 TRW-S 



initialize 9 with some vector in VL 
repeat until some stopping criterion 
for factors i? G 5 do in some fixed order that visits each factor in S at least once 
for each T E Tb reparameterize 9'^ so that i']^ gives correct min-marginals for B (eq. 1 1 1 
average B using eq. ( [T2] i 
end for 
end repeat 



Note, Algorithm [T] is a special case of tree-consistency bound optimization (TBCO) from U_6|. We 
postpone the analysis of this algorithm until section [4] One of the properties is the monotonic 
behaviour of the lower bound: $(0) never goes down. We also formally prove that the algorithm 
is characterized by the same stopping condition as the the min-sum diffusion algorithm li36J (up to 
reparameterization) . 

Step 3 of the algorithm requires computing min-marginals for factor B in tree T E Tb- This can 
be done via a junction tree algorithm |3 | in two steps as follows, (i) Choose a factor A E Ot that 
contains B\ make A the root of tree (Ot, ^t)- For each directed edge (C, D) E £t oriented toward 
A send a message C -> S* using eq. ([T3| where S = C D D. Do it in the "inward order" that starts 
from the leaves, (ii) If A^ B send a message A ^ B using ([T3|. 

It is not difficult to see that after step (i) Vj^ gives correct min-marginals for T. A sketch of the 
proof is as follows. After sending message C* — > 5 from a leaf C this message becomes valid, i.e. 
^ holds. This means that removing factors | E R (C — 5") ^ 0} from Ft will not affect 
min-marginals for the remaining factors. Applying this argument inductively gives the claim. 
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Figure 1 : Example of a chain with three 
outer factors X = abc, Y = bed, Z = 
de. (For brevity, factors {x,y, . . . ,z} are 
written as xy . . . z.) The order of factors 
in S is reflected by their x-coordinates. 




Sx Sy Sz 



3.2 TRW-S with monotonic chains 



Running the junction tree algorithm from scratch every time would be very inefficient if trees are 
large. Fortunately, we can speed up computations by reusing previously passed messages. The 
general idea of not recomputing messages when they would not change has appeared several times 
in the literature in different contexts, e.g. in Il2l[l8l|9l. To make most of this idea, we now impose 
the following assumption on the decomposition; it will allow computing min-marginals by sending 
messages only from immediate neighbors. 

6. Each tree {Ot,£t) is a monotonic chain w.r.t. to some fixed total order < on V, i.e. it 
is an ordered sequence of factors Ai,. .. , Ak such that for each pair of consecutive factors 
{Aj, Ai^i) e £t intersecting at S — AiD A^^i € S there holds 

u < V < w Vm e — 5*, w G 5, w G Ai^i — S (15) 

The total order on factors in Ot corresponding to chain T will be denoted as From now on we 
will treat £t as a directed set of edges that contains pairs {A, A') with A A'. It is convenient to 
define for factor A G Ot "left" and "right" separators as 

^ [ {min A} if A is the first factor in T 

MnA' if3{A,A')€eT ^jg^^ 

^ \ {max A} if A is the last factor in T 

Here min and max are taken w.r.t. to <; therefore, {min A} and {max A} are singleton separators 
in Ta. Note, we dropped the dependence of sep~A, sep+A on T due to Assumptionjs] 

Algorithm First, we select an ordering ^ on 5 that extends ordering <, i.e. the following holds: 

• if min A < min B and max A < max B then A -< B; 

• max A > max B and min A > min B then A)~ B. 

This can be done in several ways, e.g. by choosing a unique sequence ga = (min A, max A, . . .) 
for each A G 5 and then setting < as the lexicographical order on a a (using < for comparing 
components of (Ja)- 

The choice of < will determine the order of averaging operations: the algorithm will alternate 
between a forward pass (processing factors in S in the order <), and a backward passes (which uses 
the reverse order). 

For a factor A G O we define (see Fig.[T]) 

5a = {B G J"An5 I sep^A^B^sep+A} (17) 
It is possible to prove the following (see Appendix C): 

Proposition 3.1. If ordering -< extends < and T is a monotonic chain w.r.t. < then Tt H S = 

UaeOt 

We now formulate the TRW-S algorithm. 

Remark 1 It follows from Proposition |3.l| and definition l |17| l that in step 3 there exists exactly one factor A 
with stated properties, with one exception: if B is the first factor in J^t n5 (i.e. B = sep^ Ai where Ai £ Ot 
is the first factor of chain T) then no such A exists. Note, we do not send messages A sep^A for A £ O 
since these messages remain valid from the previous reverse pass (see the analysis in section[4|. 

Remark 2 As we will show later, sometimes we may speed up message passing operations. Consider the 
example in Fig. [T] When passing message F — >■ c in the forward pass, we know that message y — >■ &c is 
valid (from the previous reverse pass); therefore, we can compute increment (5"^(a;c) in l |13a[ l by going through 
labelings Xbc rather than through labelings xy. 



5 



Algorithm 2 TRW-S with monotonic chains 



initialize 9£il 

for each B e 5 do in the order ^ 
for each T e Ts do 

find A e Ot with B e Sa, B ^ sep^A; if it exists, send message B \nT (eq. 13 1 
end for 

average B using ( [T2] i 
end for 

if a stopping criterion is satisfied, terminate; otherwise reverse the ordering and go to step 1 



Now consider message X — > 6 in the forward pass. Message X ^ he is, invalid at this point, so we cannot use 
the trick above. However, we can instead "preemptively" compute message X — >■ 6c (without reparameterizing 
anything), and then use it both for b and be. Details are given in the next section. 

3.3 Implementation via messages 

It is easy to see that each step of Algorithm|2]preserves property 6*^ ~ 9q for T, T' E Tc, C E J- 
(assuming that it holds after initialization). Therefore, it suffices to store the cumulative vector 
9 = J2t P^^'^'^ components of vector 6 = {9'^) are then given by 

9c = —0c yCeT,TeTc (18) 

PC 

where pc — X^tsTc the factor appearance probability. By construction, vector is a repa- 
rameterization of 9 (eq. [s]), so we can store it via messages m — {ttiab \ {A, B) £ J) where 
J = {{A, B)\A£0,Be Sa}- We thus have 

9A{xA)=9AixA)' J2 '^^b('^b) ^AeO (19a) 

{A,B)eJ 

9BixB)^9BixB)+ "^ABiXB) VB G 5 (19b) 

For efficiency reasons we will also store vectors 9b for B E S explicitly, so that we don't need to 
recompute them from m every time. The resulting algorithm is given below. 



Algorithm 3 TRW-S with monotonic chains 

setmAB-^0 ViA,B) e J and 9b ■■=9b VBeS 
for each B € 5 do in the order ^ 
set 9b ■= 9b 
for each (A, B) e J do 
a B ^ sep A then 
update 

mAB{xB 



■ mm 



- mAc{xc)+y"^ —9c{xc) 
— — PC 

(A,C)e.J CeTArtS-Ts^ 

C^B 



(20) 



compute 7 = min mAB{xB)^ update mAB{xB)^ =7 /* optional: for numerical stability */ 
end if 

update 9b += rriAB 
end for 
end for 

if a stopping criterion is satisfied, terminate; otherwise reverse the ordering and go to step 1 



Reusing messages in nested factors Suppose that we have two factors P.Be Sa, A e O with 
B C P such that B is processed immediately after P in chain T e Ta, i-C- there are no other 
factors in Sa between P and B. When processing edge {A, B), we know that {A, P) contains a 
vaUd message. This allows us to speed up the computation of message from A to B. Namely, we 
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need to perform the update itiab +— Pa^^ where d^{xB) = inmx^_B v'^{xa) — v^{xb) = 
iiiina;p_B vj^{xp) — v^{xb)- Thus, the update in step 5 can be replaced by the equivalent update 



mAB{xB) 



mm 



E 



PA, 

PC 



{xc) 



Now suppose that P,Be Sa, A E O, B C P and B is processed immediately before P, i.e. there 
are no other factors in Sa between B and P. In that case we can replace step 5 for factor B with the 
following: 

(a) set m^p :— itiap, update niAP as in step 5 (where B is replaced with P), set 6ap '■= itiap — 

(b) compute 

PA. 



mm 

Xp-B 



5{xb] 

(c) update mAB + =5 and mAp{xp) 



5ap{xp) + ^ —6c{xp) 
: 8{xb) 



It can be checked that (i) the resulting message m^s is the same as the one that would be computed 
in step 5; (ii) when passing message A— > P (during the averaging step for P), the update in step 5 
would not change m^p. Thus, the latter update can be skipped (though the normalization step 6 still 
needs to be applied). Note, in operations (a)-(c) we modify niAp but do not change 6p, therefore 
equality ( |19b[ ) for factor P temporarily becomes violated (but gets restored after processing P). 



4 Algorithm's analysis 

We will first analysis the general version of TRW-S (Algorithm [T]l. We will then show that after the 
first forward pass Algorithm |2] is a special case of Algorithm[T] during the averaging step 5 vectors 
give correct min-marginals for trees T eTb- 



4.1 AnalysisofAlgorithmjT] 

We will need a few definitions. Consider subset AQV and a vector lpa with components ipA^XA)- 
We define relation (93^) C (ii)^^A'^v as 

((^^) = {a;^ I i^^(a;^) = min93^(a;^)} (21) 

X'a 

For a tree T eT define vector with components {v^{x) \ x E X) via 

v^{x)^ f{x\e^)^ dlixp) (22) 
BeJ^T 

This can be viewed as a generaUzation of definition ( fTO] ). We emphasize that vectors u'^ and i^J for 
A e J^T are uniquely determined by vector 9^ via a linear transformation. 

A projection of relation TZ C (gj^^A^v to subset P C yl is defined as 

7^B(7^) = {XB \xAen} (23) 

(Recall that xb is the restriction of labeling xa to B). 

Weak tree agreement We now define a condition characterizing a stopping criterion for TRW-S. 
Definition 4.1. Vector 6 = {6'^ | T e T) is said to satisfy the enhanced weak tree agreement (EWTA ) 
condition for factor B G if tib{{v'^)) — '^b{{v'^ )) for T, T' E Tb- 

It satisfies the weak tree agreement (WTA ) for B E J- if there exist non-empty relations [TZ^ C 

(z^^) I Per) s.t. ^B{nT)=^B{n^')forT,r eTb- 

Vector e is said to satisfy EWTA (WTA) if it satisfies EWTA (WTA) for all B E F. 
Clearly, EWTA implies WTA (but not the other way around). 
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Theorem 4.2. Let 6, 6 be respectively the vectors before and after averaging step 4 for factor B. 

(a) The lower bound does not decrease: > $(^)- 

(b) If satisfies WTAfor B with relations {TZ'^ \ T e T), then also satisfies WTA with the same 
set of relations. Furthermore, $(0) = $(©)■ 

(c) If<^{e) = $(0) then {v'^) C {v'^) for each T E T. (d) If ^{6) = $(0) and 8 does not satisfy 
EWTA for B then {v^) C {v^) for at least one tree T G 7~b- 

A proof is given in Appendix D. 
Corollary 4.3. 

• If 9 satisfies WTA then Algorithm^will not increase the lower bound $(0), and furthermore after 
a finite number of steps 6 will satisfy EWTA. 

• If does not satisfy WTA then bound $(0) will increase after a finite number of steps. 

Proof. The first claim follows from parts (b,d) of theorem [4!2l To prove the second claim, assume 
that ^{9) stays constant after an arbitrary number of steps. From parts (c,d) we conclude that after 
a finite number of steps we get vector 9 satisfying EWTA such that {v^) C {y^) for all T. This 
means that 9 satisfies WTA with relations TZ^ = ('^^)- D 

Relation to min-sum diffusion We now show that WTA condition is closely related to the stopping 
criterion of the MSD algorithm |f361 . Recall that MSD tries to maximize lower bound 

*(^) = (24) 

over vectors 9 = 9. Its stopping criterion is described in the following definition. 

Definition 4.4. Vector 9 = 6 is said to satisfy the enhanced J-consistency condition if b{{0 a)) = 
{9b) for each { A, B) e J. It is said to satisfy the J-consistency condition if there exist non-empty 
relations (TZb ^ {9b) \ B Cz such that iTBiTi-A) = T^b for each {A, B) G J. 

We denote il* to be set of vectors 9 E il that satisfy the WTA condition, and A* to be the set of 
vectors 9 = 9 that satisfy the J-consistency condition. 

Theorem 4.5. There exist mappings : f2* — !■ A* and ip : A* ^ fl* that preserve the value of the 
lower bound i.e. ^{(t){9)) = $(6/) and '^{%l^{9)) = ^{6). 

A proof is given in Appendix E. 
4.2 Analysis of Algorithm [2] 

We now analyze the TRW-S algorithm with monotonic chains. In order to do this, we will reformu- 
late it slightly. Namely, we will maintain factor CURt G Ot for each T G T ("current outer factor 
of chain T ") and factor CHILD^ G Sa for each A e O: 

0: initialize 9 Gil 

for each T G T set CURt = first factor of chain T 
for each A G O set CHILDa = sep" A 
for each B g 5 do in the order ^ 
for each T G Tij do 
let A = CURt 



if CHILD^ T^i? then send message ^^i? in T (eq.[T3]l and update CHILDa := B 
if B = sep+A and 3{A, A')gEt set CURt ■=A' 
end for 

average B using ( [T2] l 
end for 

if a stopping criterion is satisfied, terminate; otherwise reverse the ordering and go to step 1 

It should be clear that this algorithm is equivalent to Algorithm [2] In particular, the following is 
maintained: 
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Proposition 4.6. (a) In step 4 there holds B e Sa- 

(b) If A' e Ot, A' CURt then CHILDa' = sep+A'. 

(c) If A' e Ot, A' CURt then CHILD^/ = sep" A'. 



The algorithm's correctness will follow from 

Theorem 4.7. (a) Each step of the algorithm preserves the validity of edges {A, CHILD^), A e Ot 
in T: if the edge contained a valid message in T before the step (eq. \14\ , then this message remains 
valid afterwards. 

(b) After the first forward pass, all edges (A, CHILD^), A G Ot are valid in T. Consequently, in 
step 5 vector vj^ gives correct min-marginals in T for each T G Tb- 

Proof. Consider loop 1-8 for factor B, and let us fix tree T E Tb- Let A be the factor defined in 
step 3: A = CURy. It is clear that sending message A ^ B inT makes edge {A, B) valid, and that 
averaging B in step 7 preserves the validity of this edge (see eq. [14]). 

Now consider factor A' e Ot, A' -< A, and define S = CHILD^/ = sep+A'. Let us show that 
update of vectors Oj^ for C € J^a preserves the validity of edge (A', S) in T. We need to prove that 
C ^ Fa' — Ts (since the definition of a valid edge involves only vectors 0^ for D E Ta' — ^s)- 
Suppose that C E Ta' ■ By the running intersection property we have C C A!' where A" is the right 
neighbor of A, i.e. (A', A!') E Et- Therefore, C Q A' C^ A" = sep+A' = S, and so C G J"s and 
C ^ Fa' — Fs, as claimed. 

A similar argument can be used for factors A' E Ot, A' >~ A. Part (a) is proved. Part (b) easily 
follows from part (a) and the fact that step 4 makes edge A ^ B valid in T. □ 



5 Experimental results 

We compare the proposed TRW-S to (our own implementation^of) min-sum diffusion (MSD) |36|, 
MPLP {7J\ and subgradient ascent methods (SG) | lT|, the latter with (non-monotonic) chains where 
each outer factor belongs to exactly one chairj^ Our current implementation of TRW-S does not sup- 



port the second "reuse" scheme described in the end section 3.3 Since timings are implementation- 
dependent we also report a "message effort measure", where each minimization computation over a 
factor of size n contributes n. All experiments were run on a Core i5 machine with 2.5 GHz. 

We evaluate the methods on problems from the fields of computer vision and natural language 
processing: we consider image segmentation with a generalized Potts model with 2x2 blocks, with 
factor-based curvature, with constraint-based curvature and with histogram -based data terms. Also, 
we consider stereo disparity estimation with second order differences, and word alignment. For 
stereo there are 8 labels per variable, for the generalized Potts model 4 and for all other problems 2. 

Three of our problems (2x2 block Potts, stereo and factor-based curvature) use factors of low order 
only, so they are explored with singleton and pairwise separators (same style of message computa- 
tion subroutines for all compared schemes). The remaining problems are of high order (16, 9600 
and 5281 resp.). Constraint-based curvature requires handling integer linear constraints, where we 
use the method of ||l9l for the message computations. Histogram image segmentation and word 
alignment require cardinality potentials, the latter also uses 1-of-N potentials. We handle this as 
in ll3Ti and implemented speciliazed routines for MSD with these high order terms. Here, MPLP 
has an advantage over TRW-S: with the specialized computations it effectively only needs to visit 



^The code at http : //cs . nyu . edu/ "dsontag/ code/ only supports factors up to size three. 

' We used the step size rule that resembles the one in 1131 , namely \/{K + 1) where K is the number of 
times an iteration produced an inferior bound. We tried several As and chose the one that performs best after 
500 iterations (for a given instance). We also tested the step-size rule from 1 12| for problems in the top row 
of Table [1] but it was inferior to our rule. A potential reason is that the rule of 1 12| depends on the the primal 
integral solution, and so if e.g. the relaxation is not tight then the gap will always remain large. (Note, in this 
case the step size doesn't go to zero, so this rule doesn't guarantee convergence to the optimum.) We mention 
that for stereo TRW-S was pretty close to the optimum after 250 iters, while the primal solution of SG was still 
far after 500 iters. 

We also informally tested the step-size rule from |32 |, but found it to be inferior as well. 
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Figure 2: Plots of energy vs. message effort for singleton (left) and pair separators (right) for stereo. 
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Table 1: Singleton separators: relaxation values, timings (in seconds), message effort and memory 
for the compared schemes. Timings exclude any time spent on computing the intermediate bounds. 
We ran 250 iterations of TRW-S (forward-nbackward passes) and 500 of all other methods. MPLP 
and MSD can probably be sped up at the cost of extra memory. A "*" indicates that the method 
converged before the set number of iterations was used up. 



each factor once per iteration (as is always the case for the subgradient method). TRW-S needs to 
visit each factor multiple times per iteration, so it is much slower. However, immense speedups in 
TRW-S should be possible by using advanced data structures. Consider, for example, a cardinality- 
dependent factor A with binary labels. Message computation requires sorting certain values for 
nodes v ^ A. Each TRW-S update changes only one of these values, so we can use e.g. 2-3-4 trees 
for maintaning a sorted order. This gives 0(log \ A\) time per node, same as in the other techniques. 
We left it as a future work. 

Singleton Separators Table [T] compares the four methods with singleton separators on all prob- 
lems. For problems of low order TRW-S performs always best, using less message effort than MSD 
and MPLP. SG used up less message effort, but still has higher running times: handling and project- 
ing the gradients takes time, and one also has to compute minimizers along with the minimal values. 
Figure |2] plots how the energies evolve w.r.t. message effort on stereo for the different methods. 
For the high order terms TRW-S is beaten once, for histogram segmentation and by SG. To get the 
running times competitive one will need to use advanced data structures. 

Pairwise Separators Experiments with pair separators are evaluated in Table |2] as mentioned only 
for low-order problems. A plot for stereo is provided in Figure|2] Again, TRW-S is beaten once by 
the subgradient method, this time for factor-based curvature. Possibly a different variable order 
might boost TRW-S here. Otherwise TRW-S performs best. It always outperforms MSD and due to 
the reuse scheme each iteration is also faster. For problems with a large number of pair separators 
SG finally profits from its reduced message effort: for the Potts model it is clearly fastest after a 
comparable number of iterations. 
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Table 2: Pair separators: experiments with low-order factors. We give relaxation values, timings, 
message effort and memory consumption. A "*" indicates the same as above. 
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Figure 3: Data and results for second order stereo. 




Figure 4: Data and results for curvature. From left to right: input image, result with constraint-based 
curvature, result with factor-based curvature and singleton separators and result with factor-based 
curvature and pair-separators. 



1 



Figure 5: Data and results for the 2x2 Potts model. We show (near-identical) derived integral solu- 
tions with singleton and pair- separators. 



5.1 Details on the Experiments 

For second order stereo we use triplet factors in both horizontal and vertical direction. Each factor 
has the form 

if\h~l2\ < land 1^2-/31 < 1 
and|(/i-/2)-a2-^3)l = 







A 



if |Zi -/al < land 1^2 - ^3! < 1 
and|(/i-/2)-(/2-Z3)| = l 



, 3A else. 



with A = 15. If there are only singleton separators we use a specialized message computation 
routine, otherwise a generic one. We run this on a downsized Tsukuba instance (half-scale, resulting 
in 8 disparities) shown in Figure |3] 

For both factor-based ||4l |28] and constraint-based curvature 1241 we use an 8 -connectivity with 
squared differences in the data term, a curvature weight of 10000 and no length weight. We apply 
this to a 64 X 64 pixel version of the cameraman image (Figure|4]i. Our implementation is based on 
RegionCur\j^ 

The generalized Potts model is run on the lions image from Figure |5] where we use a block- weight 
of 5000. 



https : // github . com/PetterS/ regioncurv 
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Figure 6: Data and results for histogram image segmentation. The integrahty gap is large, but more 
refined strategies to obtain integral solutions are conceivable. 

Histogram segmentation [331 is run on the sea star image in Figure [6] with the shown seed nodes 
and a prior weight of 2. 

For word alignment ll23l we use 100 sentences from the Italian-Enghsh Europarl corpus. Note that 
this problem has a much more irregular structure than the computer vision problems. 

6 Conclusions 

We showed how to generalize the TRW-S algorithm from pairwise MRFs to arbitrary graphical 
models. In order to improve efficiency, we had to overcome several challenges: (i) Find a suitable 
definition of monotonic junction chains that depends only on the order on nodes, and then extend 
this order to other factors in a consistent way. (ii) Make sure that parameters for the same factor in 
different chains stay the same (thus allowing an implementation via messages); we achieved this by 
passing messages only from outer factors, (iii) Find a way to reuse message computations in nested 
factors. 

TRW-S has shown a good performance for pairwise graphical models ll29l [30l . and is among state- 
of-the-art techniques for problems such as stereo |22j^ It has also been shown that tightening the 
relaxation by adding higher-order constraints (e.g. short cycles) is an effective strategy for solving 
challenging instances ll25l |2l . Our work allows to combine the tightening strategy and the TRW-S 
technique; given results in Il22ll25l [2l. it is reasonable to assume that this would yield a state-of-the- 
art method for some applications. 

In our experiments we pursued a different direction: applying generalized TRW-S directly to high- 
order graphical models. TRW-S outperformed MSD and MPLP on a number of applications. A 
notable exception is the word alignment problem where MSD was faster. At times the subgradient 
method beats TRW-S, but it is also often heavily inferior and requires the tuning of a step-size 
parameter 

Based on the above, we hope that generalized TRW-S will become one of the standard tools for 
MAP-MRF inference. Our implementation is available from [1 1. 

One of the disadvantages of TRW-S is that it is not guaranteed to solve the LP: similarly to MSD and 
MPLP, it can get stuck in a suboptimal point. We see three ways to address this issue: (1) Interleave 
TRW-S and another technique that is guaranteed to solve the LP, e.g. a subgradient ascent. (2) 
Instead of squeezing the last bit from the current LP relaxation, one can tighten the relaxation by 
adding higher-order constraints as in 11251 121 and run TRW-S again. (3) Use a smoothed version 
of TRW-S |22|. At the moment such version has been presented only for the standard (pairwise) 
TRW-S, but we believe that generalizing it using our scheme should not be too difficult (we need to 
replace the max-product BP with the sum-product version). 

Note that for some apphcations suboptimality of message passing techniques does not seem to be 
an issue: TRW either yields a global optimum or gets very close 1371 |29l . 
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Appendix A: proof of proposition |2.1| 

Part (a) For each a;c we can write 

^ ^ia{xa)^^ ^ /iA(a;A) = ^ ^lB{xB)= ^^c{xc) 

Xa~C X^_q Xj^_^ X^_q 

where (1) holds since {A, B) E J and (2) holds since {B, C) E J. Thus, the constraint for {A, C) 
follows from constraints for {A, B), {B, C). 

Part (b) For each xq we can write 

^ 11b{xb)='^ ^ liA{xA)=^ flAixA)=fic{xc) 

Xb—c Xb — c Xa—b Xa-c 

where (1) holds since {A, B) E J and (2) holds since (^4, C) E J. Thus, the constraint for {B, C) 
follows from constraints for {A, B), {A, C). 



Appendix B: proof of proposition |2.2| 



Since set J is closed under operations (|4a|)-(|4b|), we get 

• If B E Ta,C E Tb then C E Ta- (25a) 

• lfB,CETA,BDCthenCETB. (25b) 

First, let us prove the proposition assuming that A E Ot- Consider B E Ft, B C A. Pick a factor 
A' E Ot with B E J-a' (it exists by Assumption[T]i; if there are several such factors, pick a one for 
which the distance from A to A' in the tree {Ot, £t) is minimal. We need to show that this distance 
is zero, i.e. A = A'. Suppose not; let A" E Ot be the neighbor of A' (i.e. {A\A") E £t) which is 
closer to A than A' . By the running intersection property. A' D AC A", and so S C A". Denote 
S = A' n A"; as we showed , B C S. By As sump tion |3] S E Ta' and S E Ta"- Since B E Fa', 
we have B E J-s by property ( 25b| i. Property (|25a| and the fact S E Ta" then gives B E J-a"- This 
contradicts to the choice of A' . 

It remains to prove the proposition in the case when A E Ft — Ot- Pick a factor A' E Ot with 
A E J- A'- As we showed above, we have B E Ta', therefore property ( |25b| i gives B E Ta- 



Appendix C: proof of proposition |3[T] 

We need to show that for each B E Tt H S there exists A E Ot with B E Sa- Assume that 
\Ot\ > 2 and thus |y4| > 2 for all A E Ot, otherwise the claim is trivial. 
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Let Ai, . . . ,Akhe the sequence of factors in Ot- Denote Ui = mm Ai, Vi = max A.^ (w.rt. <). We 
have Ui = minsep~A,; < mmsep~^Ai and maxsep^A^ < maxsep+A^ = w^; since ^ extends 
<, we get sep~ A,; -< sep^Ai. Therefore, 

{ui} — sep~ Ai ~< sep'''y4i = sep~yl2 ^ . . . (26) 
... ^ sep"''^fe_i = sep^Ak -< sep^Ak = {vk} 

We also have 

{iti} = min J^T n S {vk} = max J^T n S (27) 

where min, max are taken w.r.t. Indeed, for each B G Tt H S there exists Ai G Ot with 
B £ (by Assumption [Til; using Assumption [6] and the fact that -< extends <, we get {ui} -< 



{M2} < . ■ . ^ {iti} -< B. The second equation in (|27[) is proved in a similar way. 



Consider B £ Ft H S. Equations ([26),([26|) imply that there exists at least one factor Ai £ Ot 
with sep^ < B < sep+Aj. It remains to show that B C Ai, then we will have B £ J^ii by 



proposition 2.2 implying B £ 5^.. 



Consider node v £ B. There holds {ui} ^ sep A ^ B, and therefore Ui < v. Similarly, ti < Vi. 
Monotonicity assumption [6] then implies that v £ A. The claim is proved. 



Appendix D: proof of theorem 4.2 



Averaging B does not affect parameters in trees T £ Ts, so for the purpose of the proof we can 
assume w.l.o.g. that T = Tb- Furthermore, we can assume that mina; '^'^(x) — for each T £ Tb 
(this can be achieved by adding a constant to ^^{x); clearly, this does not affect theorem's claims.) 
We thus have 
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It is 



By construction, before the averaging i']^ gives correct min-marginals for T in B (eq. 
easy to see that the same holds after the averaging, i.e. gives correct min-marginals for function 
/(• I 0"^). (This is because in the definition of min-marginals we fix labeling xb, and vectors 9^ , 9^ 
differ only in components 9^{xb).) 

We thus have mina;^ v^{xb) = mina; i^-^ (a;) = 0. Inspecting update ( [T2| i, we conclude that 
1^5(353) > for each xb- This gives part (a): 

^(6)— rain 0^ (x) — pa mini/a(a;s) > 

To prove part (b), suppose that 6 satisfies RWTA for B with relations (7?.^ \ T £ T). We need to 
show that TZ^ C (i>^) for each T £ Tb- Consider labeling x £ TZ^ . For each T' £ Tb have 
Xb £ TTBi'R-'^) = TTBiji^ ), therefore ^x^ £ {v^ ) with a;^ — xb. Since v]^ gives correct 
min-marginals for B in tree T' , we conclude that v]^ {xb) = 0. This implies that ^^{xb) = (see 
eq. 



12 1. This implies that xb £ {v^) and thus x £ 



It remains to prove parts (c,d). We assume from now on that the bound does not change: $(0) = 
$(6»); thus, mina;^ vI{xb) = for T e T. 

Let as fix tree T £ Tb, and let a; be a labeling in {v^), so Xb £ {v^)- We have v^{xb) = 0; 
inspecting update ( [T2| i, we conclude that i^g {xb) — for all T' £ Tb, and so xb £ {v^) and 
x£ {v^)- This proves that {v^) C {v'^)- 

Now assume that WTA for B does not hold. This means that there exist trees T, T' £ Tb and 
labeling x£{v^) such that xb ^{l'b ) ■ The latter condition means that i/g [xb ) > 0, and therefore 

i>g{xB) >0, xb^ (l'b) and {v^)- Thus, {v'^) is a strict subset of {v^)- 



Appendix E: proof of theorem 4.5 



Constructing mapping </) : ^ A* The construction will be based on the following lemma. 
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Lemma 6.1. Consider tree T e T and non-empty relation TZ C (0^). Vector 6"^ can be reparame- 
terized in such a way that it satisfies 

(a) 9"^{xb) — Q for each B S H S and each Xg; 

(b) 7^A(7^) C {9'^) for each A e Or; 

(c) minx fix \ 0^) = EagOt ""^^c^ ^a(^a)- 



Proof. We use induction on the size of the tree. If Ot = {A} then the claim is straightforward - 
for each B e Fa we just need to "move" parameter 6^ to the outer factor A S Or, i e- update 

el{xA)+=el{xB),el{xB) :=o. 

Now consider the induction step; suppose that 10^1 > 2. Let us pick a leaf factor A e Ot and do 
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the following. First, reparameterize O'^ so that gives correct min-marginals for A in T (eq 
Second, for each B e Fa "move" all parameters 9^ to A, as above. Now consider tree T' obtained 
from T by removing factor A. Vector O'^ is obtained from O'^ by setting 6*^ := 0. Using the 
fact that j/J gives correct min-marginals for A, we conclude that TZ^ C (0^ ) where we defined 
TZ^ — TZ^ . Let us now reparameterize 9'^ (together with 9^) using the induction hypothesis. 

By construction, the obtained reparameterization 9 satisfies (a). Since 9'^ = gives correct min- 
marginals for factor A, we have tta{{9'^)) = (Oa)^ so property (b) holds for factor A. For other 
factors in Ot ~ {A} property (b) holds by the induction hypothesis. 

To prove (c), we first observe that 



= f{x* 1 e'^) - 9l{x*Y^ul{x*) - el{x') = 

where x* is a labeling in TZ'^ = TZ^ ; (1) holds by the induction hypothesis and (2) holds since by 
construction gives correct min-marginals in T. We can now write 

mmf{x I 9"^) =min !/J(a;A) =niin Oa^xa) = min 9 a' (xa') 

X X Xj^ ^ X At 

□ 

We can now construct mapping cf) : ft* A* . Consider vector 9 E il* that satisfi es W TA with 



relations {TZ'^ | T G T). Let us reparameterize each vector 6*^ as described in lemma 6.1 Clearly, 
this operation does not affect ^{0), and 6 still satisfies WTA with relations (TZ^ \ T E T). The result 
of mapping (p is now defined as 9 a = X^tgT P^^^ ■ For ^^'^h B E F define TZb — T^siT^'^) where 
T E Tb- (Note, TZb does not depend on which T is chosen, since WTA holds - see definition |4T|) 
It is easy to see that 9 satisfies relaxed J-consistency condition with relations [TZb \ B E F). We 
also have 

TeT TeTAeOT 
= V mine'A(a;A) = V mmOAixA) = ^(f) 

Constructing mapping ip : A* ^ fl* Consider vector 9 = 9 that satisfies the J-consistency 



condition with relations {TZb \ B E F). The argument used in the proof of proposition 2.1 implies 
that the J-consistency also holds. 

First, let us do the following: for each B E S pick outer factor A E O with B E Fa and "move" 
vector 9b to A, i.e. update 9a{xa) += 9b{xb), 9b{xb) :=0. 

Lemma 6.2. The update above does not affect ^^{9), and 9 still satisfies the relaxed J-consistency 
condition with relations {TZb \ B E F). 

Proof. Let 9 and 9 be the vectors before and after the update for factors A E O, B E Fa — {A}, 
respectively. Consider labeling xa G TZa Q {&a)- Note that xb E nB{TZA) = TZb C {9b)- To 



16 



prove the second claim, we need to show that xa G ^^a)- This holds since for any other labelings 

Qa{xa) = eA{xA)+QB(xB) < dAixU) + Ob{x'b) = hixA) 

The first part holds since 

mm6A{xA) + min9BixB) = OAixA) + Ob{xb) 

X^ Xg 

= 9a{xa) + 6b{xb) = rmn6A{xA) + min6ls(a;s) 

□ 

We now have vector with 9b{xb) = for all _B e 5. 

Lemma 6.3. Consider tree T = (Ot, ^t)- Define vector 0^ as follows: 0"^ = -^6 a for A e Ot 
and 6"^{xb) — Ofor B (zS, T eTb- Define relation 

= {x\xAe Ua VA e Ot} (28) 

(a) ttb{TI^)~TIb for each BeTt 

(b) f(x'^\e^)^ E mhix^elixA) for each x^en^. 
(c}V/ <Z {v'^). AeOT 

Proof. It suffices to show that TTAili^) = TIa for each A € Ot', for B E Ta ~_ {^} we will then 
have ttb{TI^) = t^b{T^a) — TIb, where the last equality holds since (A, B) E J and 9 satisfies the 
J-consistency condition with relations {TZb \ B E J^). 

We use induction on the size of the tree. For Ot = {A} the claim is obvious; suppose that \Ot \ > 2. 
Pick a leaf factor A E Ot, with (A, A) E £t- Let T' be the tree obtained from T by removing 
factor A, and S = An A E Ft'- We assume that = ■ By the running intersection property, 

{A-S)nA' = for A' eOt- {A}. 

Let x' be a labeling in TZ'^ . By themduction hypothesis x'g E TZs- Let xa be labeling in TZa with 
Xs = x'g (it exists since {A, S) E J and J-consistency holds). Let x be the labeling obtained from 
x' by changing the labeling of A — S from x'j^_g to xa-s- Clearly, x E TZ^ . 

The argument above and the induction hypothesis show that TZa' ^ t^A'{T^'^) for each A' E Ot — 
{A}. The fact that TZa ^ t^a (7?.^) is also clear (in the argument above we can first choose xa E TZa, 
and then x' E TZ^ which is consistent with x on S). The inclusion tta' {TZ^) Q TZa- for A' E Ot 
follows from the definition of TZ^ . This proves part (a). Part (b) is also easy to prove: for each 
x'^ E TZ^ we have 

/(cc^ I e^) = ^ e'^,{x'^A')= X] min6l^,(a;A/) 

where (1) holds by the induction hypothesis and the fact that a;^ E TZa- Finally, part (c) follows 
from (b) and the fact that X^AeOr "^^"^Xa ^a(^^) ^ lower bound on mina; f{x\0'^). □ 

The result of mapping -tp is now defined as described in the lemma. It is easy to see that the obtained 
vector satisfies WTA with relations [TZ^ \ T eT) from the lemma. We also have 

where (1) follows from lemma|63tb,c). 
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